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Summary 

CSTs or cxprosscd scquonco lags' aro DNA soquoncos road from both ends of 

expressed gene fragments. The Merck-WashU EST Project and several olher 

public EST projects are being performed to rapidly discover Ihe complement of 

human genes, and make (hem easily accessible. These ESTs are widely used to 

discover novel members of gene families, to map genes to chromosomes as 

'sequence-tagged sites' (STSs), and to identify mutation? leading to heritable 

diseases. Informatic strategies for querying the EST databases are discussed, 

as well as the strengths and weaknesses of the EST data. There Is a compelling 

need to build on the informatic synthesis of human gene data, and to devise Acceplod 

facile methods for determining gene functions. ie ociober 1996 


Introduction 

Our understanding of molecular biology is built upon a tradi- 
tional foundation. Start wilh a defined biological problem, 
find the genes (hat determine Ihe phenolype of interest, and 
incorporate techniques from cell biology, biochemistry, 
genetics, etc. to elucidate the biochemical pathway. This 
paradigm has allowed us to answer simple questions in 
man, and complex questions in model eukaryotes and bac- 
teria. The genome initiative, however, has taught us that wo 
can now address comprehensive questions about the 
human genome. Which genes are expressed in each cell 
type? Are genes only expressed where they are needed? 
What is the importance of a genes* location and context 
within the genome? What are the transcriptional and trans- 
lational elements of each gene? How and why are genes 
alternatively spliced? What is the extent of genetic variation 
between individuals and how does it influence human 
health, disease and behavior? While some of these ques- 
tions await a complete sequence and analysis of the human 
genome, man/ can be addressed by collectively analyzing 
the <5% of the genome thai is transcribed and translated 
into protoin. 

Development of ESTs/STSs 

The central dogma of molecular biology holds that genes 
encoded in ONA are copied into messenger RNA (mRNA). 
which is then translated into functional proteins. Molecular 
biologists typically sludy expressed genes by isolating 
mRNAs by means ol their 3'poly(A) tails, and copying Ihem 
into complementary ONA. or cONA. These cONAs repre- 
sent fragments of individual genes, which can be 'cloned" 


into DNA circles called plasmids. and replicated many times 
in f. coli. 

In the 1980s the advent of high-throughput automated 
sequencing made it possible randomly to select many cDNA 
clones from plasmid cDNA libraries and to determine the 
DNA sequence of several hundred bases from both ends. 
These short DNA sequences are called 'Expressed 
Sequence lags', or ESTs, and the position of each gene or 
other ONA marker on a physical chromosome map is called 
a Sequence lagged Site, or STS. Since ESTs and STSs 
are sequenco-based, each is amenable to PCR amplifica- 
tion, a powerful tool for searching and characterizing genes. 
The EST sequence is sufficient to identify known genes, 
and to glimpse the biochemical functions of many novel, 
genes. 

As an example, Mcrck-WashU ESTs representing tumor 
suppressor gene p53 are depicted in Fig. 1. In the case of 
p53, the most common mutant gene in neoplasias, 
5'EST:H61357 represents a partial coding region and 
3'EST:H62385 represents a 3' noncoding sequence from 
the same clone. 5'EST:T80132 does not overlap 
5'EST:H61357. but is readily assigned to Ihe same gene by 
idonlity in the 3'ESTs from the Iwo clones 

History of the EST approach 

Sequencing of randomly selected hepatic cDNAs demon- 
strated Ihe utility of ONA sequence-to-gene funchon rela- 
tionships as early as 1983 nj . 

The EST approach was described in 1992 almost simulta- 
neously by Sikela 1 * 1 and Malsubara ,0) . and was pursued on a 
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